Horovod
References
- #PAPER Horovod: fast and easy distributed deep learning in TensorFlow (Sergeev 2018)
- #CODE https://github.com/horovod/horovod
- https://horovod.readthedocs.io/en/latest/keras.html
- https://horovod.readthedocs.io/en/stable/tensorflow.html
- https://eng.uber.com/horovod/
- Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use. Horovod is hosted by the LF AI Foundation (Linux Foundation AI). Horovod implements all-reduce operations into the back-propagation computation to average the computed gradients and allow the distributed scaling among multiple GPUs. Based on Baidu ring allreduce (http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/)
- [Not straightforward from Jupyterlab](https://github.com/horovod/horovod/issues/622. Possible solution - Ipyparallel:)
Examples
- https://github.com/horovod/horovod/tree/master/examples
- https://horovod.readthedocs.io/en/stable/running_include.html
- https://github.com/horovod/tutorials/blob/master/fashion_mnist/README.md
- Distributed Deep Learning with Horovod (Jordi Torres)
- https://towardsdatascience.com/a-quick-guide-to-distributed-training-with-tensorflow-and-horovod-on-amazon-sagemaker-dae18371ef6e
- with SLURM on the BSC-P9 cluster
- With SLURM workload manager. See paper Ramirez-Gargallo 2019 in AI/Data Engineering/Distributed DL
- Example with SLURM